About

We are studying the question of how US-based PT programs’ observation hours for prospective students vary with geographic location.

Dataset

The dataset was sourced from the PTCAS website (for requirements) & Google Maps (for program locations). It has the following shape:

dim(oh_data)
## [1] 240   8

meaning there are 243 programs & 7 (really 6) descriptive columns. Previewing it:

head(oh_data, n=10)

Outlining the non-trivial columns:

Summary Statistics

Requirement Category

oh_data %>%
  ggplot() +
    geom_bar(aes(as.factor(Category))) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    xlab('Requirement Category') +
    ylab('Number of Programs')

where the categories are:

categories

83% of programs require observation hours.

Required Hours

summary(oh_data$Minimum.Hours)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    40.0    50.0    55.3    80.0   300.0      30

The distribution of hours across all programs looks like:

oh_data %>%
  filter(!is.na(Minimum.Hours)) %>%
  ggplot() +
    geom_histogram(aes(Minimum.Hours), bins = 15)

Peeking at distributions conditioned on the requirement category (specifically 1 & 2, which do impose a requirement):

oh_data %>%
  filter(!is.na(Minimum.Hours) & Category %in% c(1,2)) %>%
  ggplot() +
    geom_violin(aes(as.factor(Category), Minimum.Hours))

The distributions are similar, with modes at 40 & 100 hours.

Incorporating Geography

Visualizing

Points

First, we view each program as a point on the US map, where color indicates the hours requirement:

us <- map_data('state')
usa <- map_data('usa') %>%
  filter(region == 'main')

gg <- oh_data %>%
  ggplot()  +
    geom_polygon(data = us, aes(x=long, y = lat, group = group), fill='grey', color='white') +
    geom_point(aes(x=Longitude, y=Latitude, color = Minimum.Hours.Imputed), size=10) + 
    scale_color_distiller(palette=1, direction=1) + 
    guides(color=guide_colorbar(title='Required Hours',
                               barwidth = 2, barheight = 20, 
                               title.theme=element_text(size=15), 
                               label.theme = element_text(size=20))) +
    ggtitle('PTCAS Program Required Observation Hours') +
    theme(plot.title = element_text(size=30),
          axis.text=element_text(size=20),
          axis.title=element_text(size=20)) +
  xlab('Longitude') +
  ylab('Latitude') +
    theme(plot.title = element_text(size=30),
        axis.text=element_text(size=20),
        axis.title=element_text(size=20),
        axis.line=element_blank(),axis.text.x=element_blank(),
        axis.text.y=element_blank(),axis.ticks=element_blank(),
        axis.title.x=element_blank(),
        axis.title.y=element_blank(),
        panel.background=element_blank(),panel.border=element_blank(),panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),plot.background=element_blank())

gg

There is heavy concentration of programs in the Northeast US, so we zoom in:

gg + coord_cartesian(xlim = c(-79, -67), ylim = c(35, 47))

Heatmap

To build a more interprettable visual, we need to:

  • Populate dead space by imputing expected number of required hours (modeling)
  • Capture trends
  • Capture pockets

RuleFit is a model fitting procedure that identifies “similar” hyperrectangles (2D rectangles, in our case) in the data. Regularization (LASSO) & cross validation are used to fit the model. The final model is selected using the cross validation error minimizing regularization parameter; this is more aggressive (i.e. higher false positive rate) than using the regularization parameter within a standard error of the minimum.

train_data <- oh_data %>%
  filter(!is.na(Minimum.Hours.Imputed))
model <- xrf(Minimum.Hours.Imputed ~ Latitude + Longitude, 
             train_data, 
             family = 'gaussian',
             xgb_control = list(nrounds = 5, max_depth = 2),
             deoverlap = TRUE)

box <- us %>%
  summarize(
    min_lat = min(lat),
    max_lat = max(lat),
    min_long = min(long),
    max_long = max(long)
  )
points <- expand.grid(
  Latitude = seq(box$min_lat, box$max_lat, by = .1),
  Longitude = seq(box$min_long, box$max_long, by = .1),
  Minimum.Hours.Imputed = -1 # https://github.com/holub008/xrf/issues/9
) %>%
  filter(point.in.polygon(Longitude, Latitude, usa$long, usa$lat) > 0)

points$ehours <- predict(model, points , lambda = 'lambda.min')[,1]

ggplot(points) +
  geom_raster(aes(x = Longitude, y = Latitude, fill = ehours), interpolate = TRUE) +
  geom_polygon(data = us, aes(x=long, y = lat, group = group), fill=NA, color='black') + 
  scale_fill_distiller(palette=1, direction=1) + 
  guides(fill=guide_colorbar(title='Required Hours',
                             barwidth = 2, barheight = 20, 
                             title.theme=element_text(size=15), 
                             label.theme = element_text(size=20),
                             draw.ulim = FALSE, draw.llim = TRUE)) +
  ggtitle('PTCAS Program Required Observation Hours By Geography') +
  theme(plot.title = element_text(size=30),
        axis.text=element_text(size=20),
        axis.title=element_text(size=20),
        axis.line=element_blank(),axis.text.x=element_blank(),
        axis.text.y=element_blank(),axis.ticks=element_blank(),
        axis.title.x=element_blank(),
        axis.title.y=element_blank(),
        panel.background=element_blank(),panel.border=element_blank(),panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),plot.background=element_blank())

There certainly appears to be a trend of decreasing hours as Longitude increases (moving towards the East). There appears to be pockets of (regionally) high requirements in the LA area & the “central” east coast, while New England is a pocket of particularly low requirements.

Correlations

Main Effect Lat/Long

Visualizing

To visualize main effect correlations, we look at marginal x/y plots of latitude and longitude. Smoothing splines are used to trend the data.

oh_data %>%
  filter(!is.na(Minimum.Hours.Imputed)) %>%
  ggplot(aes(Latitude, log(Minimum.Hours.Imputed))) +
    geom_point() +
    geom_smooth()

Hours look pretty flat across Latitude, with perhaps a slight decrease moving North.

oh_data %>%
  filter(!is.na(Minimum.Hours.Imputed)) %>%
  ggplot(aes(Longitude, log(Minimum.Hours.Imputed))) +
    geom_point() +
    geom_smooth(method="lm")

There is a definite decrease in hours moving west to east, with a possible jump upwards at the coast.

Inference (Non-zero Effects)

To determine if the main effect correlations are non-zero, we perform a bootstrap on the correlation coefficients, building 95% intervals:

bs <- bootstrap(oh_data %>% filter(!is.na(Minimum.Hours)), corr_lat = cor(Latitude, Minimum.Hours, method='spearman'), corr_long = cor(Longitude, Minimum.Hours, method='spearman'))

summary.bootstrap(bs)
## Number of boostrap samples: 100 
## Empirical interval level: 0.95 
## 
##  statistic       mean standard_error   ci_lower    ci_upper
##   corr_lat -0.1490313     0.06970795 -0.2940977 -0.01282814
##  corr_long -0.1681075     0.07493308 -0.3022004 -0.03326307

Matching our visual intuition, the Longitude correlation interval certainly does not include 0 (i.e. is a non-zero effect), while Latitude is less convincing (interval does include a 0 correlation). For future analyses, it is extremely likely that these two effects interact (e.g. due to high requirements in the Northeast US).

Spatial Autocorrelation

Spatial autocorrelation is a measure of the extent to which any given observation depends on the value of other observations in the shared locality. Do required hours exhibit this property?

We should already feel confident that yes, they do, since Latitude (a geographic component) correlates. However, we can answer specific questions using different types of autocorrelation. In this section, we use Moran’s I to compute autocorrelation, while varying the weighting matrix to answer specific questions.

morani <- function(x, W) {
  n <- length(x)
  stopifnot(n == dim(W)[1] && n == dim(W)[2])
  
  squared_deviation_sum <- 0
  weighted_covariance_sum <- 0
  xbar <- mean(x)
  for (i in 1:n) {
    deviation_i <- x[i] - xbar
    for (j in 1:n) {
      weighted_covariance_sum <- weighted_covariance_sum + W[i, j] * deviation_i * (x[j] - xbar)
    }
    squared_deviation_sum <- squared_deviation_sum + deviation_i ^ 2
  }
  
  (n / sum(W)) * (weighted_covariance_sum / squared_deviation_sum)
}

Nearest Neighbors

Here we posit the question: Do programs typically have similar hours to their nearest neighboring programs? We select a parameterization of k=10 neighboring programs, which is a plausibly small enough set of schools that a program would consider in their own determination, while large enough to likely provide a signal. Distance between programs is simply the shortest distance between the two programs on a globe (Haversine distance).

haversine_distance <- function(lat1, lon1, lat2, lon2) {
  haversine(c(lat1, lon1), c(lat2, lon2))
}

# points must be a dataframe with Latitude and Longitude columns
# returns distances in KM exponeniated by distance_exp
compute_pairwise_distances <- function(points, distance_exp = 1) {
  distances <- matrix(0, nrow=nrow(points), ncol=nrow(points))
  for (i in 1:nrow(points)) {
    for (j in i:nrow(points)) {
      dist <- haversine_distance(points[i, 'Latitude'], points[i, 'Longitude'],
                                points[j, 'Latitude'], points[j, 'Longitude']) ^ distance_exp
      distances[i, j] <- dist
      distances[j, i] <- dist
    }
  }
  
  distances
}

compute_nearest_neighbors <- function(d, k=5) {
  stopifnot(dim(d)[1] == dim(d)[2])
  stopifnot(dim(d)[1] > k)
  
  neighbor_graph <- matrix(0, nrow=nrow(d), ncol=nrow(d))

  for (i in 1:nrow(d)) {
    kth_distance <- sort(d[i,])[k + 1]
    nn_ix <- which(d[i,] <= kth_distance)
    nn_ix <- nn_ix[nn_ix != i]
    neighbor_graph[i, nn_ix] <- 1
  }
  
  neighbor_graph
}

relevant_oh_data <- oh_data %>%
  filter(!is.na(Minimum.Hours.Imputed))

program_distances <- compute_pairwise_distances(relevant_oh_data)
nn <- compute_nearest_neighbors(program_distances, 10)
(moran_observed <- morani(relevant_oh_data$Minimum.Hours.Imputed, nn))
## [1] 0.1390841

Moran’s I is calculated as .143; at 216 programs, its expectation under the null is nearly 0. This value, which suggests a positive autocorrelation (programs take similar hours to those programs around them), is quite a bit larger, but how do we assess significance? Monte Carlo simulation will be used to build a null distribution of Moran’s I (i.e. the empirical distribution when no autocorrelation exists):

# Spatial Monte-Carlo: randomly reassign values to points and recompute Moran's I to build null distribution
spatial_mc <- function(data, trials = 1000) {
  program_distances <- compute_pairwise_distances(data)
  nn <- compute_nearest_neighbors(program_distances, 10)
  
  empirical_distribution <- c()
  for (trail_ix in 1:trials) {
    resampled_outcome <- sample(data$Minimum.Hours.Imputed, nrow(data), replace=TRUE)
    moran_resampled <- morani(resampled_outcome, nn)
    empirical_distribution <- c(empirical_distribution, moran_resampled)
  }
  
  empirical_distribution
}

moran_null <- spatial_mc(relevant_oh_data)
data.frame(moran_i = moran_null) %>%
ggplot() +
  geom_histogram(aes(moran_i)) +
  geom_vline(xintercept = moran_observed) + 
  xlab("Moran's I") +
  ggtitle("Spatial Autocorrelation: Monte Carlo Null Distribution vs. Observed")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1 - sum(moran_null < moran_observed) / length(moran_null)
## [1] 0

Moran’s I shows high signficance, with p < .001.

To better visualize this correlation, here is a plot of program observation hours (x-axis) against mean observation hours of neighboring 5 programs (y-axis):

nn_mean <- c()
for (ix in 1:nrow(nn)) {
  neighbor_ixs <- which(nn[ix,] > 0)
  nn_mean <- c(nn_mean, mean(relevant_oh_data[neighbor_ixs, 'Minimum.Hours.Imputed']))
}
relevant_oh_data$nn_mean <- nn_mean
ggplot(relevant_oh_data, aes(Minimum.Hours.Imputed, nn_mean)) +
  geom_jitter() +
  geom_smooth(method='lm')

Notice that the linear model visually suggests a smaller effect than reality, since there are several high leverage points near the max extent of required hours. Moran’s I is robust to these high leverage points. Nonetheless, the trend is visually apparent, particularly when the x axis is log-transformed:

ggplot(relevant_oh_data, aes(log(Minimum.Hours.Imputed), nn_mean)) +
  geom_jitter() +
  geom_smooth(method='lm')

Haversine Distance (straight line distance)

Haversine distance can be used to demonstrate the effect of continuous distance on required hours. The subtle difference from nearest neighbors is that continuous distance considers all programs, not just the closest k.

# invert, so that nearer programs have higher weight
haversine_weights <- 1 / program_distances
haversine_weights[is.infinite(haversine_weights)] <- 0
(haversine_moran <- morani(relevant_oh_data$Minimum.Hours.Imputed, haversine_weights))
## [1] 0.2103927

And significance testing:

spatial_continuous_mc <- function(data, trials = 1000) {
  haversine_weights <- 1 / program_distances
  haversine_weights[is.infinite(haversine_weights)] <- 0
  
  empirical_distribution <- c()
  for (trail_ix in 1:trials) {
    resampled_outcome <- sample(data$Minimum.Hours.Imputed, nrow(data), replace=TRUE)
    moran_resampled <- morani(resampled_outcome, haversine_weights)
    empirical_distribution <- c(empirical_distribution, moran_resampled)
  }
  
  empirical_distribution
}

haversine_null <- spatial_continuous_mc(relevant_oh_data)
data.frame(moran_i = haversine_null) %>%
ggplot() +
  geom_histogram(aes(moran_i)) +
  geom_vline(xintercept = haversine_moran) + 
  xlab("Moran's I") +
  ggtitle("Spatial Autocorrelation: Monte Carlo Null Distribution vs. Observed")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1 - sum(haversine_null < haversine_moran) / length(haversine_null)
## [1] 0.012

There is certainly significance (p=.006).

Visually the programs required hours against distance weighted mean of observation hours:

continuous_means <- c()
for (ix in 1:nrow(nn)) {
  continuous_means <- c(continuous_means, mean(program_distances[ix,] * relevant_oh_data$Minimum.Hours.Imputed))
}

relevant_oh_data$continuous_mean <- continuous_means
ggplot(relevant_oh_data, aes(Minimum.Hours.Imputed, continuous_means)) +
  geom_jitter() +
  geom_smooth(method='lm') +
  ylab('Distance weighted required hours')

Interpretation of this visual is less intuitive.

More importantly, this test does not align with nearly as plausible of a hypothesis: that programs look at a discrete set of nearby programs to determine their own hours (see Nearest Neighbors section).

Program Density Estimation

Are required number of hours tied to geographic program density? One hypothesis is that increased program density correlates with descreased availability of student openings in that locale, and hence lower required hours.

density_estimate_grid <- kde2d(oh_data$Longitude, oh_data$Latitude, n=100)

gr <- with(density_estimate_grid, data.frame(expand.grid(x,y), as.vector(z)))
names(gr) <- c("Longitude", "Latitude", "z")
density_estimate <- loess(z~Longitude*Latitude, data=gr)

oh_data$density_estimate <- predict(density_estimate, oh_data)

ggplot(oh_data, aes(density_estimate, log(Minimum.Hours.Imputed))) +
  geom_point() +
  geom_smooth(method='lm')
## Warning: Removed 24 rows containing non-finite values (stat_smooth).
## Warning: Removed 24 rows containing missing values (geom_point).

cor.test(oh_data$Minimum.Hours.Imputed, oh_data$density_estimate, 
         use="complete.obs", alternative = "less", method="spearman")
## Warning in cor.test.default(oh_data$Minimum.Hours.Imputed,
## oh_data$density_estimate, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  oh_data$Minimum.Hours.Imputed and oh_data$density_estimate
## S = 1850660, p-value = 0.06782
## alternative hypothesis: true rho is less than 0
## sample estimates:
##        rho 
## -0.1018586

The trend is possibly there (weak significance, at p=.06), and the scatter plot shows that the effect size is small (< 5 hours variation across the full range of program density values).